Bank Marketing

1. Frame

A Portuguese Bank wants to run a direct marketing campaign to sell its new term deposit plan. The goal is to help them identify customers who would most likely buy the plan?

Open Discussion How to approach this problem?

2. Acquire

UCI has a number of datasets related to machine learning. We will leverage the Bank Marketing dataset. Look into this link for more information https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Load the train and test datasets



In [1]:

    
train <- 
test <-

Attribute Information:

Input variables:

bank client data:

age (numeric)
job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
default: has credit in default? (categorical: 'no','yes','unknown')
housing: has housing loan? (categorical: 'no','yes','unknown')
loan: has personal loan? (categorical: 'no','yes','unknown')

contact: contact communication type (categorical: 'cellular','telephone')
month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: 'yes','no')

3. Explore

Exercise



In [2]:

    
#Find frequency of deposit



In [6]:

    
# Train:
print("Train")
#Code here
print("Test")
#Code here









    



[1] "Train"

   no   yes 
31092  4119 
[1] "Test"

  no  yes 
8830 1170



In [7]:

    
# Find number of rows and columns in train and test
print("Train")
#Code here
print("Test")
#Code here









    



[1] "Train"






    Out[7]:





	35211
	17








    



[1] "Test"






    Out[7]:





	10000
	17



In [9]:

    
# Find names of the features in train and test
print("Train")
#Code here
print("Test")
#Code here









    



[1] "Train"






    Out[9]:





	'age'
	'job'
	'marital'
	'education'
	'default'
	'balance'
	'housing'
	'loan'
	'contact'
	'day'
	'month'
	'duration'
	'campaign'
	'pdays'
	'previous'
	'poutcome'
	'deposit'








    



[1] "Test"






    Out[9]:





	'age'
	'job'
	'marital'
	'education'
	'default'
	'balance'
	'housing'
	'loan'
	'contact'
	'day'
	'month'
	'duration'
	'campaign'
	'pdays'
	'previous'
	'poutcome'
	'deposit'

Exercise

Plot Deposit
Plot Deposit Vs Age
Plot Deposit Vs another feature
Plot any two features against each other



In [ ]:



In [ ]:



In [ ]:



In [ ]:

4. Refine



In [13]:

    
#Print the head of train
head(train)









    Out[13]:





age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit

	1 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
	2 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
	3 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
	4 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
	5 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
	6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no

Exercise



In [7]:

    
#Print the head of test dataset









    Out[7]:





age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit

	1 38 self-employed single secondary no 677 yes no cellular 14 may 114 2 -1 0 unknown no
	2 58 blue-collar married primary no 5445 yes no cellular 14 apr 391 1 -1 0 unknown no
	3 55 retired married secondary no 5 no no unknown 20 jun 108 1 -1 0 unknown no
	4 26 management single secondary no 63 no no cellular 28 jul 76 4 -1 0 unknown no
	5 48 technician divorced tertiary no 907 no yes cellular 4 aug 103 1 -1 0 unknown no
	6 33 technician married tertiary no 525 yes yes unknown 28 may 139 1 -1 0 unknown no

Find the class of each of the columns in train dataset



In [3]:

    
lapply(train, class)









    Out[3]:





	$age
		'integer'
	$job
		'factor'
	$marital
		'factor'
	$education
		'factor'
	$default
		'factor'
	$balance
		'integer'
	$housing
		'factor'
	$loan
		'factor'
	$contact
		'factor'
	$day
		'integer'
	$month
		'factor'
	$duration
		'integer'
	$campaign
		'integer'
	$pdays
		'integer'
	$previous
		'integer'
	$poutcome
		'factor'
	$deposit
		'factor'

Exercise



In [11]:

    
#Find the classes for the features in test dataset. Does it match the train dataset?



In [ ]:

Exercise



In [14]:

    
#Find if train and test have missing values



In [15]:

    
#Does train has missing values?



In [ ]:



In [16]:

    
#Does test has missing values?



In [ ]:



In [19]:

    
#Find column names of train and test









    Out[19]:





	'age'
	'job'
	'marital'
	'education'
	'default'
	'balance'
	'housing'
	'loan'
	'contact'
	'day'
	'month'
	'duration'
	'campaign'
	'pdays'
	'previous'
	'poutcome'
	'deposit'



In [14]:

    
# Find summary statistics of train









    Out[14]:





      age                 job           marital          education    
 Min.   :18.00   blue-collar:7560   divorced: 4086   primary  : 5315  
 1st Qu.:33.00   management :7382   married :21161   secondary:18053  
 Median :39.00   technician :5926   single  : 9964   tertiary :10406  
 Mean   :40.97   admin.     :4059                    unknown  : 1437  
 3rd Qu.:48.00   services   :3186                                     
 Max.   :95.00   retired    :1788                                     
                 (Other)    :5310                                     
 default        balance       housing      loan            contact     
 no :34571   Min.   : -8019   no :15663   no :29573   cellular :22817  
 yes:  640   1st Qu.:    71   yes:19548   yes: 5638   telephone: 2274  
             Median :   447                           unknown  :10120  
             Mean   :  1356                                            
             3rd Qu.:  1418                                            
             Max.   :102127                                            
                                                                       
      day           month          duration         campaign     
 Min.   : 1.0   may    :10751   Min.   :   0.0   Min.   : 1.000  
 1st Qu.: 8.0   jul    : 5398   1st Qu.: 103.0   1st Qu.: 1.000  
 Median :16.0   aug    : 4902   Median : 180.0   Median : 2.000  
 Mean   :15.8   jun    : 4123   Mean   : 258.2   Mean   : 2.759  
 3rd Qu.:21.0   nov    : 3126   3rd Qu.: 319.0   3rd Qu.: 3.000  
 Max.   :31.0   apr    : 2249   Max.   :4918.0   Max.   :63.000  
                (Other): 4662                                    
     pdays          previous           poutcome     deposit    
 Min.   : -1.0   Min.   :  0.0000   failure: 3803   no :31092  
 1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1425   yes: 4119  
 Median : -1.0   Median :  0.0000   success: 1184              
 Mean   : 40.1   Mean   :  0.5827   unknown:28799              
 3rd Qu.: -1.0   3rd Qu.:  0.0000                              
 Max.   :871.0   Max.   :275.0000



In [15]:

    
#Find summary statistics of test









    Out[15]:





      age                 job           marital         education    default   
 Min.   :18.00   blue-collar:2172   divorced:1121   primary  :1536   no :9825  
 1st Qu.:33.00   management :2076   married :6053   secondary:5149   yes: 175  
 Median :39.00   technician :1671   single  :2826   tertiary :2895             
 Mean   :40.83   admin.     :1112                   unknown  : 420             
 3rd Qu.:48.00   services   : 968                                              
 Max.   :89.00   retired    : 476                                              
                 (Other)    :1525                                              
    balance      housing     loan           contact          day       
 Min.   :-2827   no :4418   no :8394   cellular :6468   Min.   : 1.00  
 1st Qu.:   75   yes:5582   yes:1606   telephone: 632   1st Qu.: 8.00  
 Median :  450                         unknown  :2900   Median :16.00  
 Mean   : 1385                                          Mean   :15.82  
 3rd Qu.: 1448                                          3rd Qu.:21.00  
 Max.   :66721                                          Max.   :31.00  
                                                                       
     month         duration         campaign         pdays       
 may    :3015   Min.   :   0.0   Min.   : 1.00   Min.   : -1.00  
 jul    :1497   1st Qu.: 102.0   1st Qu.: 1.00   1st Qu.: -1.00  
 aug    :1345   Median : 180.0   Median : 2.00   Median : -1.00  
 jun    :1218   Mean   : 258.1   Mean   : 2.78   Mean   : 40.53  
 nov    : 844   3rd Qu.: 317.0   3rd Qu.: 3.00   3rd Qu.: -1.00  
 apr    : 683   Max.   :3785.0   Max.   :58.00   Max.   :805.00  
 (Other):1398                                                    
    previous          poutcome    deposit   
 Min.   : 0.0000   failure:1098   no :8830  
 1st Qu.: 0.0000   other  : 415   yes:1170  
 Median : 0.0000   success: 327             
 Mean   : 0.5721   unknown:8160             
 3rd Qu.: 0.0000                            
 Max.   :35.0000



In [17]:

    
#Find correlation between age and deposit in train dataset









    Out[17]:




0.0351975918370962



In [20]:

    
#Find correlation between campaign and deposit in train dataset









    Out[20]:




-0.0747459602733383



In [21]:

    
#What is the standard deviation of age in train?









    Out[21]:




10.6511973326696



In [22]:

    
#What is the mean of age in train?









    Out[22]:




40.9651529351623



In [ ]:

5. Model

Logistic Regression



In [30]:

    
# Logit Function



In [44]:

    
x <- seq(0,1, length=100)
x <- x[2:(length(x)-1)]



In [45]:

    
logit <- function (t) {
log( t / (1-t) )
}



In [46]:

    
plot(x~logit(x), type="l")



In [47]:

    
inv_logit <- function(x){
    exp(x)/(1+exp(x))
}



In [51]:

    
y <- seq(-100,100, length=200)



In [50]:

    
plot(y~inv_logit(y), type="l")

Exercise

Plot the inverse logit for values between -3 and +3

Logistic Regression Equation

Compare this with linear regression



In [ ]:



In [52]:

    
#Running the model on train



In [54]:

    
model <- glm(deposit~., family=binomial(link="logit"), data=train)



In [55]:

    
summary(model)









    Out[55]:





Call:
glm(formula = deposit ~ ., family = binomial(link = "logit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.7236  -0.3718  -0.2509  -0.1490   3.2350  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -2.631e+00  2.087e-01 -12.606  < 2e-16 ***
age                 1.394e-03  2.493e-03   0.559 0.576186    
jobblue-collar     -2.997e-01  8.245e-02  -3.635 0.000278 ***
jobentrepreneur    -3.095e-01  1.407e-01  -2.200 0.027827 *  
jobhousemaid       -4.539e-01  1.573e-01  -2.886 0.003898 ** 
jobmanagement      -1.602e-01  8.359e-02  -1.917 0.055285 .  
jobretired          2.954e-01  1.094e-01   2.700 0.006931 ** 
jobself-employed   -3.636e-01  1.308e-01  -2.781 0.005420 ** 
jobservices        -2.161e-01  9.625e-02  -2.246 0.024730 *  
jobstudent          4.002e-01  1.244e-01   3.216 0.001298 ** 
jobtechnician      -1.877e-01  7.863e-02  -2.387 0.016978 *  
jobunemployed      -1.492e-01  1.260e-01  -1.184 0.236536    
jobunknown         -2.792e-01  2.611e-01  -1.069 0.284963    
maritalmarried     -1.658e-01  6.683e-02  -2.480 0.013125 *  
maritalsingle       1.267e-01  7.625e-02   1.661 0.096696 .  
educationsecondary  1.903e-01  7.326e-02   2.598 0.009385 ** 
educationtertiary   3.622e-01  8.549e-02   4.236 2.27e-05 ***
educationunknown    2.786e-01  1.180e-01   2.361 0.018246 *  
defaultyes         -2.536e-02  1.889e-01  -0.134 0.893185    
balance             1.144e-05  5.908e-06   1.936 0.052854 .  
housingyes         -6.768e-01  5.005e-02 -13.522  < 2e-16 ***
loanyes            -4.496e-01  6.814e-02  -6.598 4.18e-11 ***
contacttelephone   -1.209e-01  8.449e-02  -1.431 0.152432    
contactunknown     -1.608e+00  8.249e-02 -19.491  < 2e-16 ***
day                 1.088e-02  2.815e-03   3.866 0.000111 ***
monthaug           -7.135e-01  8.897e-02  -8.019 1.07e-15 ***
monthdec            8.035e-01  1.967e-01   4.086 4.40e-05 ***
monthfeb           -1.560e-01  1.018e-01  -1.533 0.125232    
monthjan           -1.473e+00  1.445e-01 -10.193  < 2e-16 ***
monthjul           -8.461e-01  8.765e-02  -9.654  < 2e-16 ***
monthjun            4.940e-01  1.058e-01   4.671 3.00e-06 ***
monthmar            1.549e+00  1.362e-01  11.374  < 2e-16 ***
monthmay           -4.446e-01  8.192e-02  -5.427 5.72e-08 ***
monthnov           -9.546e-01  9.687e-02  -9.854  < 2e-16 ***
monthoct            8.909e-01  1.233e-01   7.227 4.95e-13 ***
monthsep            8.471e-01  1.342e-01   6.311 2.78e-10 ***
duration            4.198e-03  7.313e-05  57.406  < 2e-16 ***
campaign           -9.653e-02  1.170e-02  -8.251  < 2e-16 ***
pdays               1.646e-04  3.426e-04   0.480 0.630937    
previous            1.212e-02  7.443e-03   1.628 0.103554    
poutcomeother       2.233e-01  1.017e-01   2.196 0.028124 *  
poutcomesuccess     2.275e+00  9.375e-02  24.270  < 2e-16 ***
poutcomeunknown    -6.327e-02  1.059e-01  -0.598 0.550143    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 25413  on 35210  degrees of freedom
Residual deviance: 16708  on 35168  degrees of freedom
AIC: 16794

Number of Fisher Scoring iterations: 6



In [56]:

    
#Predict on test
test_prediction <- predict(model, test, type="response")



In [57]:

    
head(test_prediction)









    Out[57]:





	1
		0.0334536795812485
	2
		0.119641661275679
	3
		0.0553739182953198
	4
		0.0427454452591198
	5
		0.0382007472958081
	6
		0.00655807206116041



In [58]:

    
class(test_prediction)









    Out[58]:




'numeric'



In [59]:

    
library(ROCR)









    



Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

    lowess

Computing Error Metrics



In [64]:

    
# True Positive Rate
# False Positive Rate
# Area Under the Curve
# Precision
# Recall

src: https://en.wikipedia.org/wiki/Precision_and_recall



In [66]:

    
pr <- prediction(test_prediction, test$deposit)



In [67]:

    
prf <- performance(pr, measure = "tpr", x.measure = "fpr")



In [79]:

    
plot(prf)



In [69]:

    
#Area under curve
auc <- performance(pr, measure = "auc")



In [70]:

    
auc <- auc@y.values[[1]]



In [71]:

    
auc









    Out[71]:




0.902732138881628



In [80]:

    
#Precision and Recall
precision_recall <- performance(pr, "prec", "rec")
plot(precision_recall)

Reference

http://nlp.stanford.edu/manning/courses/ling289/logistic.pdf

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	deposit
1	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
2	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
3	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
4	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
5	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
6	28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	deposit
1	38	self-employed	single	secondary	no	677	yes	no	cellular	14	may	114	2	-1	unknown	no
2	58	blue-collar	married	primary	no	5445	yes	no	cellular	14	apr	391	1	-1	unknown	no
3	55	retired	married	secondary	no	5	no	no	unknown	20	jun	108	1	-1	unknown	no
4	26	management	single	secondary	no	63	no	no	cellular	28	jul	76	4	-1	unknown	no
5	48	technician	divorced	tertiary	no	907	no	yes	cellular	4	aug	103	1	-1	unknown	no
6	33	technician	married	tertiary	no	525	yes	yes	unknown	28	may	139	1	-1	unknown	no

Bank Marketing

1. Frame

2. Acquire

Attribute Information:

Input variables:

bank client data:

related with the last contact of the current campaign:

other attributes:

social and economic context attributes

Output variable (desired target):

3. Explore

4. Refine

Find the class of each of the columns in train dataset

5. Model

Logistic Regression

Logistic Regression Equation

Computing Error Metrics

Reference